Skip to content

Conversation

@nkomonen-amazon
Copy link
Contributor

@nkomonen-amazon nkomonen-amazon commented Oct 15, 2024

Problem:

When a user does a sleep then wake of their computer, the heartbeat is not up to date since it cannot be sent when the user's computer is asleep.

Due to this there is a race condition on wake between the next fresh heartbeat being sent versus when we check the heartbeats to determine if they are stale (crash). If the crash checker runs before a new heartbeat can be sent, it will be seen as a crash.

Solution:

Use a TimeLag class that helps to determine when there is a time discrepancy. It works by updating a state every second, and if we determine that the next update to that state took longer than a second, we determine that there was a lag. Then we simply skip the next crash check, allowing a fresh heartbeat to be sent.


License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

@nkomonen-amazon nkomonen-amazon requested a review from a team as a code owner October 15, 2024 19:00
@nkomonen-amazon
Copy link
Contributor Author

/runIntegrationTests

@github-actions
Copy link

This pull request modifies code in src/ but no tests were added/updated. Confirm whether tests should be added or ensure the PR description explains why tests are not required.

@github-actions
Copy link

This pull request implements a feature or fix, so it must include a changelog entry. See CONTRIBUTING.md#changelog for instructions.

@nkomonen-amazon nkomonen-amazon changed the title fix(crashMonitoring): handle sleep/wake appropriately fix(crash): handle sleep/wake appropriately Oct 15, 2024
* Solution: Keep track of the lag, and then skip the next crash check if there was a lag. This will give time for the
* next heartbeat to be sent.
*/
class TimeLag {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this class abstracting? why is this not part of the existing Heartbeat class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying issue is checking for a stale heartbeat. We cannot reliably compensate for lag by sending a heartbeat since the crash checking is a separate process, and there becomes a race condition between sending a fresh heartbeat vs checking for a stale one. The solution is that the crash checker compensates for "time lag" by skipping a check

This class could live inside the CrashChecker class, but I'm currently passing the TimeLag instance in as a dependency to it.

Comment on lines 383 to 388
private isCompleted: Promise<void> | undefined
/** Resolves {@link isCompleted} when the next interval of lag checking is completed */
private setCompleted: (() => void) | undefined

/** The last timestamp we remember. If this is more than 1 {@link lagCheckInterval} we probably did a sleep+wake */
private latestTimestamp: number = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 3 properities are common to the existing Timeout class in our timoutUtils module. does using Timeout save any code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created this SIM as there is some extra work since we need an interval instead of a timeout: IDE-15104

* Heartbeats that indicate the extension instance is still running.
* {@link CrashChecker} listens for these.
*/
class Heartbeat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hearbeat is a concept that continues to come up, e.g. we have another heartbeat for codecatalyst:

private tryStartActivityHeartbeat = shared(async (reauth: boolean) => {

Even if the codecatalyst heartbeat is too different to be useful here, is this new Heartbeat not amenable to generalization? It is a general concept that is needed by different features that want to emit data on a timed basis.

Do we have a backlog item to pull this out / generalize it? Are features easily able to attach data to a toolkit_heartbeat metric, or did we name the metric crashreporting_x (not intuitive)?

Copy link
Contributor Author

@nkomonen-amazon nkomonen-amazon Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't much reuse between Codecatalyst heartbeat and this heartbeat atm. This one heartbeats on an interval, and the other heartbeats when the user performs actions in the IDE.

We could make an interface for the common apis, but this doesn't deduplicate any existing code.

I'll made a backlog item for unifying the heartbeats: https://issues.amazon.com/issues/IDE-15108


Are features easily able to attach data to a toolkit_heartbeat metric, or did we name the metric crashreporting_x (not intuitive)?

We dont have a metric sent for heartbeats since it is a bit noisy.

Problem:

When a user does a sleep then wake, our heartbeats are not up to date since
they cannot send when the user's computer is asleep.

Due to this there is a race condition between the heartbeats sent and when we check the
heartbeats to determine if they are stale (crash).

Solution:

Keep a TimeLag class that helps to assert that there is no lag in our time.
It works by updating a state every second. And if we determine that the next
update to that state took longer than a second, we determine that there was a lag.
Then we simply skip the next crash check, allowing a fresh heartbeat to be sent.

Signed-off-by: nkomonen-amazon <[email protected]>
There was some excessive code due to an older design. This removes
the use of nested folders in favor of using the root folder to store
state. This simplifies the code.

Signed-off-by: nkomonen-amazon <[email protected]>
Signed-off-by: nkomonen-amazon <[email protected]>
@justinmk3
Copy link
Contributor

Approved in case this is needed for the release.

Although CC heartbeat doesn't have much overlap, and currently there is no other consumer, we know that there are multiple future use-cases for attaching data to a global heartbeat (and this would have eliminated some existing metrics), so it's a good idea to start thinking about lifting this out of crash-reporting. Especially if a metric [will be] involved.

Possibly the UserActivity concept is related.

Signed-off-by: nkomonen-amazon <[email protected]>
@nkomonen-amazon nkomonen-amazon merged commit 449de46 into aws:master Oct 17, 2024
24 checks passed
@nkomonen-amazon nkomonen-amazon deleted the crashSleepFix branch October 17, 2024 03:33
tverney pushed a commit to tverney/aws-toolkit-vscode that referenced this pull request Oct 21, 2024
## Problem:

When a user does a sleep then wake of their computer, the heartbeat is
not up to date since it cannot be sent when the user's computer is
asleep.

Due to this there is a race condition on wake between the next fresh
heartbeat being sent versus when we check the heartbeats to determine if
they are stale (crash). If the crash checker runs before a new heartbeat
can be sent, it will be seen as a crash.

## Solution:

Use a TimeLag class that helps to determine when there is a time
discrepancy. It works by updating a state every second, and if we
determine that the next update to that state took longer than a second,
we determine that there was a lag. Then we simply skip the next crash
check, allowing a fresh heartbeat to be sent.

---

<!--- REMINDER: Ensure that your PR meets the guidelines in
CONTRIBUTING.md -->

License: I confirm that my contribution is made under the terms of the
Apache 2.0 license.

---------

Signed-off-by: nkomonen-amazon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants